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the  necessary  first  step  in  that  exploration:  a  method  for  automatically  finding  parallel 
translated  documents  on  the  Web.  The  technique  is  conceptually  simple,  fully  language 
independent,  and  scalable,  and  preliminary  evaluation  results  indicate  that  the  method 
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Abstract.  Parallel  corpora  are  a  valuable  resource  for  machine  trans¬ 
lation,  but  at  present  their  availability  and  utility  is  limited  by  genre- 
and  domain-specihcity,  licensing  restrictions,  and  the  basic  difficulty  of 
locating  parallel  texts  in  all  but  the  most  dominant  of  the  world’s  lan¬ 
guages.  A  parallel  corpus  resource  not  yet  explored  is  the  World  Wide 
Web,  which  hosts  an  abundance  of  pages  in  parallel  translation,  offering  a 
potential  solution  to  some  of  these  problems  and  unique  opportunities  of 
its  own.  This  paper  presents  the  necessary  hrst  step  in  that  exploration: 
a  method  for  automatically  hnding  parallel  translated  documents  on  the 
Web.  The  technique  is  conceptually  simple,  fully  language  independent, 
and  scalable,  and  preliminary  evaluation  results  indicate  that  the  method 
may  be  accurate  enough  to  apply  without  human  intervention. 


1  Introduction 

In  recent  years  large  parallel  corpora  have  taken  on  an  important  role  as  re¬ 
sources  in  machine  translation  and  multilingual  natural  language  processing,  for 
such  purposes  as  lexical  acquisition  (e.g.  Gale  and  Church,  1991a;  Melamed, 
1997),  statistical  translation  models  (e.g.  Brown  et  ah,  1990;  Melamed  1998), 
and  cross-language  information  retrieval  (e.g.  Davis  and  Dunning,  1995;  Lan- 
dauer  and  Littman,  1990;  also  see  Oard,  1997).  However,  for  all  but  relatively 
few  language  pairs,  parallel  corpora  are  available  only  in  relatively  specialized 
forms  such  as  United  Nations  proceedings  (LDC,  1996),  religious  texts  (Resnik, 
Olsen,  and  Diab,  1998),  and  localized  versions  of  software  manuals  (Resnik  and 
Melamed,  1997).  Even  for  the  top  dozen  or  so  majority  languages,  the  available 
parallel  corpora  tend  to  be  unbalanced,  representing  primarily  governmental  and 
newswire-style  texts.  In  addition,  like  other  language  resources,  parallel  corpora 
are  often  encumbered  by  fees  or  licensing  restrictions.  For  all  these  reasons,  fol¬ 
lowing  the  “more  data  are  better  data”  advice  of  Church  and  Mercer  (1993), 
abandoning  balance  in  favor  of  volume,  is  difhcult. 

A  parallel  corpus  resource  not  yet  explored  is  the  World  Wide  Web,  which 
hosts  an  abundance  of  pages  in  parallel  translation,  offering  a  potential  solution 
to  some  of  these  problems  and  some  unique  opportunities  of  its  own.  The  Web 
contains  parallel  pages  in  many  languages,  by  innumerable  authors,  in  multiple 


Fig.  1.  The  STRAND  architecture 


genres  and  domains,  and  its  content  is  continually  enriched  by  language  change 
and  modified  by  cultural  context.  In  this  paper  I  will  not  attempt  to  explore 
whether  such  a  free-wheeling  source  of  linguistic  content  is  better  or  worse  than 
the  more  controlled  parallel  corpora  in  use  today. 

Rather,  this  paper  presents  the  necessary  first  step  in  that  exploration;  a 
method  for  automatically  finding  parallel  translated  documents  on  the  Web 
that  I  call  STRAND  (Structural  Translation  Recognition  for  Acquiring  Natural 
Data).  The  technique  is  conceptually  simple,  fully  language  independent,  and 
scalable,  and  preliminary  evaluation  results  indicate  that  the  method  may  be 
accurate  enough  to  apply  without  human  intervention. 

In  Section  2  I  lay  out  the  STRAND  architecture  and  describe  in  detail  the 
core  of  the  method,  a  language-independent  structurally  based  algorithm  for 
assessing  whether  or  not  two  Web  pages  were  intended  to  be  parallel  translations. 
Section  3  presents  preliminary  evaluation,  and  Section  4  discusses  future  work. 

2  The  STRAND  Architecture 

As  Figure  1  illustrates,  the  STRAND  architecture  is  a  simple  pipeline.  Given 
a  particular  pair  of  languages  of  interest,  a  candidate  generation  module  first 
generates  pairs  (urll,url2)  identifying  World  Wide  Web  pages  that  may  be  par¬ 
allel  translations.^  Next,  a  language  independent  candidate  evaluation  module 
behaves  as  a  filter,  keeping  only  those  candidate  pairs  that  are  likely  to  actu¬ 
ally  be  translations.  Optionally,  a  third  module  for  language- dependent  filtering 
applies  additional  filtering  criteria  that  might  depend  upon  language-specific  re¬ 
sources.  The  end  result  is  a  set  of  candidate  pairs  that  can  reliably  be  added  to 
the  Web-based  parallel  corpus  for  these  two  languages. 

The  approach  to  candidate  evaluation  taken  in  this  paper  has  a  useful  side 
effect;  in  assessing  the  likelihood  that  two  HTML  documents  are  parallel  trans- 

^  A  URL,  or  uniform  resource  locator,  is  the  address  of  a  document  or  other  resource 
on  the  World  Wide  Web. 


lations,  the  module  produces  a  segment-level  alignment  for  the  document  pair, 
where  segments  are  chunks  of  text  appearing  in  between  markup.  Thus  STRAND 
has  the  potential  of  producing  a  segment-aligned  parallel  corpus  rather  than,  or 
in  addition  to,  a  document-aligned  parallel  corpus.  In  this  paper,  however,  only 
the  quality  of  document-level  alignment  is  evaluated.^ 


2.1  Candidate  Generation 

At  present  the  candidate  generation  module  is  implemented  very  simply.  First, 
a  query  is  submitted  to  the  Altavista  Web  search  engine,  which  identifies  Web 
pages  containing  at  least  one  hyperlink  where  ’language!’  appears  in  the  text 
or  URL  associated  with  the  link,  and  at  least  one  such  link  for  language2.^  For 
example,  Altavista’s  “advanced  search”  can  be  given  Boolean  queries  in  this 
form; 

anchor : "language 1"  AID  anchor : "language2" 

A  query  of  this  kind,  using  english  and  french  as  languagel  and  language2,  re¬ 
spectively,  locates  the  home  page  of  the  Academy  of  American  &  British  English, 
at  http;//www. academyofenglish.com/  (Figure  2),  among  many  others. 

On  some  pages,  images  alone  are  used  to  identify  alternative  language  ver¬ 
sions  —  the  flag  of  France  linking  to  a  French-language  page,  for  example,  but 
without  the  word  “French”  being  visible  to  the  user.  Text-based  queries  can  still 
locate  such  pages  much  of  the  time,  however,  because  the  HTML  markup  for 
the  page  conventionally  includes  the  name  of  the  language  for  display  by  non- 
graphical  browsers  (in  the  ALT  field  of  the  IMG  element).  Names  of  languages 
sometimes  also  appear  in  other  parts  of  a  URL  —  for  example,  the  file  containing 
the  image  of  the  French  flag  might  be  named  french,  gif.  The  Altavista  query 
above  succeeds  in  identifying  all  these  cases  and  numerous  others. 

In  the  second  step  of  candidate  generation,  each  page  returned  by  Altavista 
is  automatically  processed  to  extract  all  pairs  (urll,url2)  appearing  in  anchors 
(ui,  O'j)  such  that  ai  contains  ‘languagel’,  02  contains  ‘language2’,  and  ai  and  02 
are  no  more  than  10  lines  apart  in  the  HTML  source  for  the  page.  This  distance 
criterion  captures  the  fact  that  for  most  Web  pages  that  point  off  to  parallel 
translations,  the  links  to  those  translations  appear  relatively  close  together,  as 
is  the  case  in  Figure  2. 

I  have  not  experimented  much  with  variants  on  this  simple  method  for 
candidate  generation,  and  it  clearly  could  be  improved  in  numerous  ways  to 
retrieve  a  greater  number  of  good  candidates.  For  example,  it  might  make 

^  HTML,  or  hypertext  markup  language,  is  currently  the  authoring  language  for  most 
Web  pages.  The  STRAND  approach  should  also  be  applicable  to  SGML,  XML,  and 
other  formats,  but  they  will  not  be  discussed  here. 

^  An  “anchor”  is  a  piece  of  HTML  document  that  encodes  a  hypertext  link.  It  typically 
includes  the  URL  of  the  page  being  linked  to  and  text  the  user  can  click  on  to  go 
there;  it  may  contain  other  information,  as  well.  The  URL  for  Altavista’s  “advanced 
search”  page  is  http://altavista.digital.com/cgi-bin/query?pg=aqfcwhat=web. 


sense  to  issue  a  query  seeking  documents  in  language2  with  an  anchor  con¬ 
taining  ‘languagel’  (e.g.  query  Altavista  for  pages  in  French  containing  point¬ 
ers  to  ‘English’,  to  capture  the  many  pairs  connected  by  a  link  saying  ‘En¬ 
glish  version’).  Or,  it  might  be  possible  to  exploit  parallel  URL  and/or  direc¬ 
tory  structure;  for  example,  the  URLs  http;//amta98.org/en/program.html  and 
http;//amta98.org/fr/program.html  are  more  likely  than  other  URL  pairs  to  be 
referring  to  parallel  pages,  and  the  directory  subtrees  under  en  and  fr  on  the 
hctitious  amta98.org  server  might  be  well  worth  exploring  for  other  potential 
candidate  pairs. 

Eor  this  initial  investigation,  however,  generating  a  reasonable  set  of  can¬ 
didates  was  the  necessary  hrst  step,  and  the  simple  approach  above  works  well 
enough.  Alternatives  to  the  current  candidate  generation  module  will  be  explored 
in  future  work. 

2.2  Candidate  Evaluation 

The  core  of  the  STRAND  approach  is  its  method  for  evaluating  candidate  pairs 
—  that  is,  determining  whether  two  pages  should  be  considered  parallel  trans¬ 
lations.  This  method  exploits  two  facts.  Eirst,  parallel  pages  are  filled  with  a 
great  deal  of  identical  HTML  markup.  Second,  work  on  bilingual  text  alignment 
has  established  that  there  is  a  reliably  linear  relationship  in  the  lengths  of  text 
translations  (Gale  and  Church,  1991b;  Melamed,  1996).  The  algorithm  works  by 
using  pieces  of  identical  markup  as  reliable  points  of  correspondence  and  com¬ 
puting  a  best  alignment  of  markup  and  non-markup  chunks  between  the  two 
documents.  R  then  computes  the  correlation  for  the  lengths  of  the  non-markup 
chunks.  A  test  for  the  significance  of  this  correlation  is  used  to  decide  whether 
or  not  a  candidate  pair  should  be  identified  as  parallel  text. 
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Fig.  3.  Example  of  a  candidate  pair 


For  example,  Figure  .3  shows  fragments  from  a  pair  of  pages  identified  by 
strand’s  candidate  generation  module  in  the  experiment  to  be  described  in 
Section  3.  An  English  page  is  at  left,  Spanish  at  right. Notice  the  extent  to 
which  the  page  layout  is  parallel,  and  the  way  in  which  corresponding  units  of 
text  —  list  items,  for  example  —  have  correspondingly  greater  or  smaller  lengths. 

In  more  detail,  the  steps  in  candidate  evaluation  are  as  follows; 


1.  Linearize.  Both  documents  in  the  candidate  pair  are  run  through  a  markup 
analyzer  that  acts  as  a  transducer,  producing  a  linear  sequence  containing 
three  kinds  of  token; 

[START; element_label]  e.g.  [START;!],  [START; LI] 

[BID ; element_label]  e.g.  [BID; A] 

[Chunk; length]  e.g.  [Chunk; 174] 

2.  Align  the  linearized  sequences.  There  are  many  approaches  one  can  take  to 
aligning  sequences  of  elements.  In  the  current  prototype,  the  Unix  sd«yutility 
does  a  fine  job  of  alignment,  matching  up  identical  START  and  END  tokens 
in  the  sequence  and  Chunk  tokens  of  identical  length  in  such  a  way  as  to 
minimize  the  differences  between  the  two  sequences.  Eor  example,  consider 
two  documents  that  begin  as  follows; 


^  Source:  http://www.legaldatasearch.com/. 


<HTML> 

<TITLE>Emergency  Exit</TITLE> 
<BODY> 

<Hl>Emergency  Exit</Hl> 

If  seated  at  an  exit  and 


<HTML> 

<TITLE>Sortie  de  Secours</TITLE> 
<BODY> 

Si  vous  etes  assis  a 
cote  d’une... 


The  aligned  linearized  sequence  would  be  as  follows;® 


[START : HTML] 
[START: TITLE] 
[Chunk : 12] 
[ElfD:  TITLE] 
[START : BODY] 
[START: HI] 
[Chunk : 12] 
[EID:H1] 
[Chunk: 112] 


[START : HTML] 
[START: TITLE] 
[Chunk: 15] 
[EID: TITLE] 
[START: BODY] 


[Chunk: 122] 


3.  Threshold  the  aligned,  linearized  sequences  based  on  mismatches.  When  two 
pages  are  not  parallel,  there  is  a  high  proportion  of  mismatches  in  the  align¬ 
ment  —  sequence  tokens  on  one  side  that  have  no  corresponding  token  on 
the  other  side,  such  as  the  tokens  associated  with  the  HI  element  in  the 
above  example.  This  can  happen,  for  example,  when  two  documents  are 
translations  up  to  a  point,  e.g.  an  introduction,  but  one  document  goes  on 
to  include  a  great  deal  more  content  than  another.  Even  more  frequently, 
the  proportion  is  high  when  two  documents  are  prima  facie  bad  candidates 
for  a  translation  pair.  Eor  these  reasons,  candidate  pairs  whose  mismatch 
proportion  exceeds  a  constant,  K,  are  eliminated  at  this  stage.  My  current 
value  for  K  was  set  manually  at  20%  based  on  experience  with  a  develop¬ 
ment  set,  and  that  value  was  frozen  and  used  in  the  experiment  described  in 
the  next  section.  In  that  experiment  evaluation  of  STRAND  was  done  using 
a  different  set  of  previously  unseen  documents,  for  a  different  language  pair 
than  the  one  used  during  development. 

4.  Compute  a  confidence  value.  Let  {X,Y)  =  {(*1,  t/i),  t/„)}  be  the 

lengths  for  the  aligned  Chunk  tokens  in  Step  2,  such  that  Xj  is  not  equal  to 
Uj .  (When  they  are  exactly  equal,  this  virtually  always  means  the  aligned  seg¬ 
ments  are  not  natural  language  text.  If  included  these  inflate  the  correlation 
coefhcient.)  Eor  the  above  alignment  this  would  be  {(12, 15),  (112, 122), . . .}. 
Compute  the  Pearson  correlation  coefhcient  r(X,Y),  and  compute  the  sig- 
nihcance  of  that  correlation  in  textbook  fashion.  Note  that  the  signihcance 
calculation  takes  the  number  n  of  aligned  text  segments  into  account.  The 

^  Note  that  whitespace  is  ignored  in  connting  chnnk  lengths. 


Fig.  4.  Scatterplots  illustrating  reliable  correlation  in  lengths  of  aligned  segments  for 
good  translation  pairs  (left  and  right),  and  lack  of  correlation  for  a  bad  pair  (center). 


resulting  p  value  is  used  to  threshold  significance;  using  the  standard  thresh¬ 
old  ofp  <  .05  (i.e.  95%  confidence  that  the  correlation  would  not  have  been 
obtained  by  chance)  worked  well  during  development,  and  I  retained  that 
threshold  in  the  evaluation  described  in  the  section  that  follows. 

Figure  4  shows  plots  of  {X,Y)  for  three  real  candidate  pairs.  At  left  is  the  pair 
illustrated  in  Figure  3,  correctly  accepted  by  the  candidate  evaluation  module 
with  r  =  .99,  p  <  .001.  At  center  is  a  pair  correctly  rejected  by  candidate  eval¬ 
uation;  in  this  case  r  =  .24,  p  >  .4,  and  the  mismatch  proportion  exceeds  75%. 
And  at  right  is  another  pair  correctly  accepted;  in  this  more  unusual  case,  the 
correlation  is  lower  (r  =  .57)  but  statistically  very  reliable  because  of  the  large 
number  of  data  points  (p  <  .0005). 

Notice  that  a  by-product  of  this  structurally-driven  candidate  evaluation 
scheme  is  a  set  of  aligned  Chunk  tokens.  These  correspond  to  aligned  non-markup 
segments  in  the  document  pair.  Evaluating  the  accuracy  of  this  segment-level 
alignment  is  left  for  future  work. 

2.3  Language-Dependent  Filtering 

I  have  not  experimented  with  further  filtering  of  candidate  pairs  since,  as  shown 
in  the  next  section,  precision  is  already  quite  high.  However,  experience  with 
the  small  number  of  false  positives  I  have  seen  suggests  that  automatic  language 
identification  on  the  remaining  candidate  pairs  might  weed  out  the  few  that  re¬ 
main.  Very  high  accuracy  language  identification  using  character  n-gram  models 
requires  only  a  modest  amount  of  training  text  known  to  be  in  the  languages  of 
interest  (Dunning,  1994;  Grefenstette,  1995). 

3  Evaluation 

I  developed  the  STRAND  prototype  using  English  and  Erench  as  the  relevant 
pair  of  languages.  Eor  evaluation  I  froze  the  code  and  all  parameters  and  ran 


the  prototype  for  English  and  Spanish,  not  having  previously  looked  at  En¬ 
glish/Spanish  pairings  on  the  Web. 

Eor  the  candidate  generation  phase,  I  followed  the  approach  of  Section  2.1 
and  generated  candidate  document  pairs  from  the  first  200  hits  returned  by  the 
Altavista  search  engine,  leading  to  a  set  of  198  candidate  pairs  of  URLs  that 
met  the  distance  criterion. 

Of  those  198  candidate  pairs,  12  were  pairs  where  urll  and  url2  pointed 
to  identical  pages,  and  so  these  are  eliminated  from  consideration.  In  96  cases 
one  or  both  pages  in  the  pair  could  not  be  retrieved  (page  not  found,  moved, 
empty,  server  unreachable,  etc.).  The  remaining  90  cases  are  considered  the  set 
of  candidate  pairs  for  evaluation. 

I  evaluated  the  90  candidate  pairs  by  hand,  determining  that  24  represented 
true  translation  pairs.®  The  criterion  for  this  determination  was  the  question; 
Was  this  pair  of  pages  intended  to  provide  the  same  content  in  the  two  different 
languages?  Although  admittedly  subjective,  the  judgments  are  generally  quite 
clear;  I  include  URLs  in  an  on-line  Appendix  so  that  the  reader  may  judge  for 
himself  or  herself.  The  STRAND  prototype’s  performance  against  this  test  set 
was  as  follows; 

—  The  candidate  evaluation  module  identified  17  of  the  90  candidate  pairs  as 
true  translations,  and  was  correct  for  15  of  those  17,  a  precision  of  88.2%.  (A 
language-dependent  filtering  module  with  100%  correct  language  identifica¬ 
tion  would  have  eliminated  one  of  the  two  false  positives,  giving  a  precision 
of  93.8%.  However,  language-dependent  filtering  was  not  used  in  this  evalu¬ 
ation.) 

—  The  algorithm  identified  15  of  24  true  translation  pairs,  a  recall  of  62.5%. 

Manual  assessment  of  the  translation  pairs  retrieved  by  the  algorithm  suggests 
that  they  are  representative  of  what  one  would  expect  to  find  on  the  Web;  the 
pages  vary  widely  in  length,  content,  and  the  proportion  of  usable  parallel  natural 
language  text  in  comparison  to  markup,  graphics,  and  the  like.  However,  I  found 
the  yield  of  genuine  parallel  text  —  content  in  one  language  and  its  corresponding 
translation  in  the  other  —  to  be  encouraging.  The  reader  may  form  his  or  her 
own  judgment  by  looking  at  the  pages  identified  in  the  on-line  Appendix. 

4  Future  Work 

At  present  it  is  difficult  to  estimate  how  many  pairs  of  translated  pages  may  ex¬ 
ist  on  the  World  Wide  Web.  However,  it  seems  fair  to  say  that  there  are  a  great 
many,  and  that  the  number  will  increase  as  the  Web  continues  to  expand  inter¬ 
nationally.  The  method  for  candidate  generation  proposed  in  this  paper  makes 

®  A  few  of  the  90  candidate  pairs  were  encoded  in  non-HTML  format,  e.g.  PDF 
{portable  document  format).  I  exclnded  these  from  consideration  a  priori  becanse 
strand’s  capabilities  are  cnrrently  limited  to  HTML. 


it  possible  to  quickly  locate  candidate  pairs  without  building  a  Web  crawler, 
but  in  principle  one  could  in  fact  think  of  the  entire  set  of  pages  on  the  Web 
as  a  source  for  candidate  generation.  The  preliminary  figures  for  recall  and  es¬ 
pecially  for  precision  suggest  that  large  parallel  corpora  can  be  acquired  from 
the  Web  with  only  a  relatively  small  degree  of  noise,  even  without  human  filter¬ 
ing.  Accurate  language-dependent  filtering  (e.g.  based  on  language  identification, 
as  in  Section  2.3)  would  likely  increase  the  precision,  reducing  noise,  without 
substantially  reducing  the  recall  of  useful,  true  document  pairs.  In  addition  to 
language-dependent  filtering,  the  following  are  some  areas  of  investigation  for 
future  work. 

—  Additional  evaluation.  As  advertised  in  the  title  of  this  paper,  the  results 
thus  far  are  preliminary.  The  STRAND  approach  needs  to  be  evaluated  with 
other  language  pairs,  on  larger  candidate  sets,  with  independent  evaluators 
being  used  in  order  to  accurately  estimate  an  upper  bound  on  the  reliability 
of  judgments  as  to  whether  a  candidate  pair  represents  a  true  translation. 
One  could  also  evaluate  how  precision  varies  with  recall,  but  I  believe  for  this 
task  there  are  sufhciently  many  genuine  translation  pairs  on  the  Web  and 
a  sufhciently  high  recall  that  the  focus  should  be  on  maximizing  precision. 
Alternative  approaches  to  candidate  generation  from  the  Web,  as  discussed 
in  Section  2.1,  are  a  topic  for  further  investigation. 

—  Scalability.  The  prototype,  implemented  in  decidedly  non-optimized  fash¬ 
ion  using  a  combination  of  perl,  C,  and  shell  scripts,  currently  evaluates 
candidate  pairs  at  approximately  1.8  seconds  per  candidate  on  a  Sun  Ul¬ 
tra  1  workstation  with  128  megabytes  of  real  memory,  when  the  pages  are 
already  resident  on  a  disk  on  the  local  network  (though  not  local  to  the 
workstation  itself).  Thus,  excluding  retrieval  time  of  pages  from  the  Web, 
evaluating  1  million  retrievable  candidate  pairs  using  the  existing  prototype 
would  take  just  over  3  weeks  of  real  time.  However,  STRAND  can  easily 
be  run  in  parallel  on  an  arbitrary  number  of  machines,  and  the  prototype 
reimplemented  in  order  to  obtain  signihcant  speed-ups.  The  main  bottleneck 
to  the  approach,  the  time  spent  retrieving  pages  from  the  Web,  is  still  trivial 
if  compared  to  manual  construction  of  corpora.  In  real  use,  STRAND  would 
probably  be  run  as  a  continuous  process,  constantly  extending  the  corpus, 
so  that  the  cost  of  retrieval  would  be  amortized  over  a  long  period. 

—  Segment  alignment.  As  discussed  in  Section  2.2,  a  by-product  of  the  can¬ 
didate  evaluation  module  in  STRAND  is  a  set  of  aligned  text  segments.  The 
quality  of  the  segment-level  alignment  needs  to  be  evaluated,  and  should  be 
compared  against  alternative  alignment  algorithms  based  on  the  document- 
aligned  collection. 

—  Additional  filtering.  Although  a  primary  goal  of  this  work  is  to  obtain  a 
large,  heterogeneous  corpus,  for  some  purposes  it  may  be  useful  to  further 
filter  document  pairs.  For  example,  in  some  applications  it  might  be  impor- 


tant  to  restrict  attention  to  document  pairs  that  conform  to  a  particular 
genre  or  belong  to  a  particular  topic.  The  STRAND  architecture  of  Figure  1 
is  clearly  amenable  to  additional  filtering  modules  such  as  document  classifi¬ 
cation  incorporated  into,  or  pipelined  with,  the  language-dependent  filtering 
stage. 

—  Dissemination.  Although  text  out  on  the  Web  is  generally  intended  for 
public  access,  it  is  nonetheless  protected  by  copyright.  Therefore  a  corpus 
collected  using  STRAND  could  not  legally  be  distributed  in  any  straight¬ 
forward  way.  However,  legal  constraints  do  not  prevent  multiple  sites  from 
running  their  own  versions  of  STRAND,  nor  any  such  site  from  distributing 
a  list  of  URLs  for  others  to  retrieve  themselves.  Anyone  implementing  this 
or  a  related  approach  should  be  careful  to  observe  protocols  governing  au¬ 
tomatic  programs  and  agents  on  the  Web.^ 


The  final  and  most  interesting  question  for  future  work  is;  What  can  one  do 
with  a  parallel  corpus  drawn  from  the  World  Wide  Web?  I  find  two  possibilities 
particularly  promising.  First,  from  a  linguistic  perspective,  such  a  corpus  offers 
opportunities  for  comparative  work  in  lexical  semantics,  potentially  providing  a 
rich  database  for  the  cross-linguistic  realization  of  underlying  semantic  content. 
From  the  perspective  of  applications,  the  corpus  is  an  obvious  resource  for  ac¬ 
quisition  of  translation  lexicons  and  distributionally  derived  representations  of 
word  meaning.  Most  interesting  of  all,  each  possibility  is  linked  to  many  others, 
seemingly  without  end  —  much  like  the  Web  itself. 

Acknowledgments 

This  work  was  supported  in  part  by  DARPA/ITO  contract  N66001-97-C-8540, 
Department  of  Defense  contract  MDA90496C1250,  and  a  research  grant  from 
Sun  Microsystems  Laboratories.  I  am  grateful  to  Dan  Melamed,  Doug  Oard, 
and  David  Traum  for  useful  discussions. 

Appendix:  Experimental  Data 

At  URL  http ;//umiacs.umd.edu/~resnik/amta98/amta98^ppendix.html  the  in¬ 
terested  reader  can  find  an  on-line  Appendix  containing  the  complete  test  set 
described  in  Section  3,  with  STRAND’s  classifications  and  the  author’s  judg¬ 
ments. 
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